CSaRUS-CNN at AMIA-2017 Tasks 1, 2: Under Sampled CNN for Text Classification

نویسندگان

Arjun Magge

Matthew Scotch

Graciela Gonzalez

چکیده

Most practical text classification tasks in natural language processing involve training sets where the number of training instances belonging to each of the classes are not equal. The performance of the classifier in such a case can be affected by the sampling strategies used in training. In this work, we describe a cost sensitive and random undersampling variants of convolutional neural networks (CNNs) for classifying texts in imbalanced datasets and analyze its results. The classifier proposed in this paper achieves a maximum F1-score of 0.414 placing 2nd on the ADR dataset and achieves a maximum F1-score of 0.652 placing 6th on the medication intake dataset. Introduction Text classification tasks in natural language processing (NLP) can contain datasets where the number of training instances belonging to each class are not equal. The annotation cost for gold standard labels (i.e. labels assigned by humans) are mostly associated with the number of instances labeled in the dataset. When classes are highly imbalanced, the dataset may not contain enough training examples belonging to the minority class. Having highly imbalanced classes with very few instances belonging to minority classes in such datasets can lead to a drop in classification performance. Many classifiers tend to assign the majority class to a given training instance. Learning from imbalanced data is a big problem and the subject has been studied extensively.1 Standard approaches to classifying imbalanced datasets in NLP at the data level involve oversampling and undersampling.2 The most common method to tackling this problem at the algorithmic level is using cost-sensitive learning.3 In this work, we chose to experiment with the undersampling and cost-sensitive learning methods in CNN architectures to compensate for class imbalance. For Task 1, the classification dataset contains tweets mentioning a drug and the objective is to classify the tweet into two classes i.e. 1) No-ADR : the tweet contains no evidence to indicate adverse drug reaction (ADR), and 2) ADR : the tweet contains evidence to indicate ADR. Detecting ADRs from social media and health forum texts has been an intensive area of research for early detection of ADRs and possible interventions.4–8 For Task 2, the dataset contains tweets mentioning a drug and the objective is to classify the tweet into three classes i.e. 1) Intake : the tweet contains evidence for medication intake, 2) Possible-Intake : the tweet contains evidence to suspect medication intake, and 3) No-Intake : the tweet contains no evidence of medication intake. For additional information about the dataset and its annotations, see Klein et. al.9 Method Input The datasets for the tasks contained tweets-ids and their respective categorical annotations. The first set of annotations provided for the task was used as training dataset and the second set was used development/validation set. The original texts were available for only about 40% of the annotations for Task-1 and 60% for Task-2. In Table 1, we show the details the datasets for both tasks and their respective class distributions. Classifier For the CNN classifier used in this paper, we implemented our models based on the original CNN architecture as proposed by Kim et. al. for sentence classification.10 We use this CNN architecture to construct cost sensitive and random undersampling variants to tackle the class imbalance problem. The random undersampling variant (Undersampling-CNN) is constructed by randomly sampling equal number of class-instances in each epoch. This means that there are far fewer training instances in each epoch. The cost sensitive variant Table 1: Dataset details for the ADR dataset and medication intake dataset. Category Annotated Available Class-1 Class-2 Class-3 Task-1 Training Set 10,822 4966 4407 559 Development Set 4845 2178 2024 154 Task-2 Training Set 8000 5244 1006 1611 2627 Development Set 226

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

BioTxtM 2016 Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining

Methods based on deep learning approaches have recently achieved state-of-the-art performance in a range of machine learning tasks and are increasingly applied to natural language processing (NLP). Despite strong results in various established NLP tasks involving general domain texts, there is only limited work applying these models to biomedical NLP. In this paper, we consider a Convolutional ...

متن کامل

Learning Cognitive Features from Gaze Data for Sentiment and Sarcasm Classification using Convolutional Neural Network

Cognitive NLP systemsi.e., NLP systems that make use of behavioral data augment traditional text-based features with cognitive features extracted from eye-movement patterns, EEG signals, brain-imaging etc.. Such extraction of features is typically manual. We contend that manual extraction of features may not be the best way to tackle text subtleties that characteristically prevail in complex cl...

متن کامل

Team UKNLP: Detecting ADRs, Classifying Medication Intake Messages, and Normalizing ADR Mentions on Twitter

This paper describes the systems we developed for all three tasks of the 2nd Social Media Mining for Health Applications Shared Task at AMIA 2017. The first task focuses on identifying the Twitter posts containing mentions of adverse drug reactions (ADR). The second task focuses on automatic classification of medication intake messages (among those containing drug names) on Twitter. The last ta...

متن کامل

Prostate segmentation and lesions classification in CT images using Mask R-CNN

Purpose: Non-cancerous prostate lesions such as prostate calcification, prostate enlargement, and prostate inflammation cause too many problems for men’s health. This research proposes a novel approach, a combination of image processing techniques and deep learning methods for classification and segmentation of the prostate in CT-scan images by considering the experienced physicians’ reports. ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

CSaRUS-CNN at AMIA-2017 Tasks 1, 2: Under Sampled CNN for Text Classification

نویسندگان

چکیده

منابع مشابه

Learning Document Image Features With SqueezeNet Convolutional Neural Network

BioTxtM 2016 Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining

Learning Cognitive Features from Gaze Data for Sentiment and Sarcasm Classification using Convolutional Neural Network

Team UKNLP: Detecting ADRs, Classifying Medication Intake Messages, and Normalizing ADR Mentions on Twitter

Prostate segmentation and lesions classification in CT images using Mask R-CNN

عنوان ژورنال:

اشتراک گذاری